Descriptive Analysis

The Behavioral Risk Factor Surveillance System (BRFSS) is a health-related telephone survey that is collected annually by the CDC. Each year, the survey collects responses from over 400,000 Americans on health-related risk behaviors, chronic health conditions, and the use of preventative services. It has been conducted every year since 1984. For this project, a csv of the dataset available on Kaggle for the year 2015 was used. This original dataset contains responses from 441,455 individuals and has 330 features. These features are either questions directly asked of participants, or calculated variables based on individual participant responses.

Dataset Overview

Number of observations: 253,680 Number of features: 22

The target variable in this dataset is Diabetes_012, which categorizes individuals into three distinct groups:

0: No diabetes or only during pregnancy 1: Prediabetes 2: Diabetes

However, upon exploraory data analysis, I decided to merge class 0 and 1 into class 0. Hence,

0: No diabetes or only during pregnancy and Prediabetes 1: Diabetes

Dataset Features

The features in this dataset represent various health conditions, lifestyle choices, and demographic information. Below is a brief explanation of each feature:

HighBP: Binary indicator of whether the individual has been diagnosed with high blood pressure (1: Yes, 0: No).

HighChol: Binary indicator of whether the individual has been diagnosed with high cholesterol (1: Yes, 0: No).

CholCheck: Binary indicator of whether the individual has had a cholesterol check within the past five years (1: Yes, 0: No).

BMI: Body Mass Index, a continuous variable representing an individual’s weight in relation to their height.

Smoker: Binary indicator of whether the individual has smoked at least 100 cigarettes in their lifetime (1: Yes, 0: No).

Stroke: Binary indicator of whether the individual has been diagnosed with a stroke (1: Yes, 0: No).

HeartDisease: Binary indicator of whether the individual has been diagnosed with coronary heart disease or a myocardial infarction (1: Yes, 0: No).

PhysActivity: Binary indicator of whether the individual has engaged in any physical activity or exercise other than their regular job in the past 30 days (1: Yes, 0: No).

Fruits: Binary indicator of whether the individual consumes fruit at least once per day (1: Yes, 0: No).

Veg: Binary indicator of whether the individual consumes vegetables at least once per day (1: Yes, 0: No).

Alcohol: Binary indicator of heavy alcohol consumption (1: Yes, 0: No). Defined as more than 14 drinks per week for men and more than 7 drinks per week for women.

HealthCoverage: Binary indicator of whether the individual has any kind of health care coverage, including health insurance, prepaid plans, or government plans (1: Yes, 0: No).

CostDoc: Binary indicator of whether the individual could not see a doctor in the past 12 months due to cost (1: Yes, 0: No).

GenHealth: Self-reported general health status, ranging from 1 (Excellent) to 5 (Poor).

MentalHealth: Number of days in the past 30 days where the individual’s mental health was not good.

PhysicalHealth: Number of days in the past 30 days where the individual’s physical health was not good.

DiffWalk: Binary indicator of whether the individual has serious difficulty walking or climbing stairs (1: Yes, 0: No).

Sex: Binary indicator of the individual’s sex (1: Male, 0: Female).

Age: Age categorized into 14 levels ranging from 18-24 (Level 1) to 80+ (Level 14).

Education: Highest level of education completed, ranging from 1 (Never attended school) to 6 (College graduate).

Income: Household income from all sources, categorized into 8 levels (1: Less than $10,000 to 8: $75,000 or more).

This dataset provides a comprehensive snapshot of individuals’ health and lifestyle, which can be used to predict diabetes risk. Each feature is either binary, categorical, or continuous, representing different aspects of an individual’s health profile.

By understanding these features, we can begin to explore how they correlate with diabetes status and identify key predictors that can help improve early diagnosis and intervention.

Graphical Analysis

Histogram 1

knitr::include_graphics(
  here::here("output/bargraph.png")
)

The bar plot visualization illustrates the distribution of the diabetes classes within the dataset. The majority of respondents, approximately 200,000 participants, fall into the No Diabetes category (class 0), which reflects the prevalence of non-diabetic individuals. Meanwhile, 40,000 of respondents are categorized as having diabetes (class 1).

Histogram 2

##| fig.align = "center",
#| out.width = "600px"
knitr::include_graphics(
  here::here("output/stackbargraph.png")
)

The stacked bar plot named “High BP vs Diabetes” illustrates the relationship between diabetes and blood pressure (BP) across two diabetes classes (indicated by colors). Majority of non-diabetic respondents have normal BP, while among diabetic respondents more than half of the respondents reported that they have high BP.

Scatterplot

##| fig.align = "center",
#| out.width = "600px"
knitr::include_graphics(
  here::here("output/scatterplot.png")
)

The scatter plot named “Age vs BMI by Diabetes Status” shows the relationship between an individual’s age (X-axis) and their Body Mass Index (BMI) (Y-axis) across two diabetes classes (indicated by colors).

The majority of the individuals, represented by the light blue points, belong to the Class 0 and 1 (No Diabetes and Prediabetes) category. These points are densely packed across the BMI range but are more concentrated at the lower end of the age spectrum (ages 5 to 10).

Individuals in the Class 1 (Diabetes) category, represented by light coral points, tend to cluster at higher BMIs, with some dispersion across the age groups. There is a subtle pattern suggesting that those with diabetes tend to have higher BMI values, though the data also shows significant overlap with Class 0 (No diabetes to prediabetes).

Regression Analysis

Multinomial Logistic Regression

set.seed(123)
trainIndex <- createDataPartition(dia_clean$isDiabetic, p = .8, 
                                  list = FALSE, 
                                  times = 1)
diaTrain <- dia_clean[ trainIndex,]
diaTest  <- dia_clean[-trainIndex,]

multinomial_regression <- readRDS(
  file = here::here("output/multinomial.rds")
)

summary(multinomial_regression)
## Call:
## multinom(formula = isDiabetic ~ ., data = diaTrain)
## 
## Coefficients:
##                             Values    Std. Err.
## (Intercept)           -7.722302761 0.2567608134
## HighBP1                0.662853443 0.0155710509
## HighChol1              0.540971391 0.0145507406
## CholCheck1             1.230363506 0.0691779043
## BMI                    0.057168471 0.0009921458
## Smoker1               -0.055359300 0.0142555982
## Stroke1                0.134053555 0.0272016853
## HeartDiseaseorAttack1  0.220324493 0.0193793386
## PhysActivity1         -0.037743585 0.0154037734
## Fruits1                0.006203939 0.0146886625
## Veggies1              -0.012815521 0.0170161579
## HvyAlcoholConsump1    -0.708368640 0.0388879538
## AnyHealthcare1         0.062262005 0.0352434975
## NoDocbcCost1           0.052201011 0.0244095826
## GenHlth2               0.622205083 0.0343627833
## GenHlth3               1.222431280 0.0334957255
## GenHlth4               1.653050941 0.0365066411
## GenHlth5               1.818095143 0.0443795116
## MentHlth              -0.001779914 0.0009064229
## PhysHlth              -0.003700485 0.0008661651
## DiffWalk1              0.130941976 0.0182194961
## Sex1                   0.249830361 0.0144662170
## Age2                   0.216300836 0.1449213372
## Age3                   0.472275909 0.1314129102
## Age4                   0.920834961 0.1245435999
## Age5                   1.106074223 0.1222233924
## Age6                   1.347707073 0.1202772037
## Age7                   1.536671775 0.1190776507
## Age8                   1.635978556 0.1186517406
## Age9                   1.847995942 0.1183861309
## Age10                  2.010157808 0.1184201950
## Age11                  2.051068260 0.1188681798
## Age12                  1.967383014 0.1196157761
## Age13                  1.781900091 0.1197902028
## Education2             0.144434714 0.2151639424
## Education3             0.033730003 0.2130554153
## Education4            -0.082413119 0.2115927572
## Education5            -0.045164070 0.2116732712
## Education6            -0.095432518 0.2117895916
## Income2               -0.051450411 0.0383152463
## Income3               -0.076347216 0.0368557530
## Income4               -0.119727225 0.0362101660
## Income5               -0.175795904 0.0356408360
## Income6               -0.257582302 0.0350978774
## Income7               -0.269465210 0.0354482989
## Income8               -0.378988566 0.0350996423
## 
## Residual Deviance: 136029.5 
## AIC: 136121.5

We fit a multinomial logistic regression model using the training data. This model estimates the probability of an individual being in each of the two diabetes classes based on the health indicators via BRFSS.

The coefficients from the model represent the estimated change in the log-odds of being in a particular class (Class 1) compared to Class 0 (the reference class) for each unit change in the predictor variable.

For example, the coefficient for HighBP (high blood pressure) in Class 1 (Diabetes) is 0.6629, indicating that having high blood pressure increases the log-odds of being diabetic (Class 1) compared to being non-diabetic (Class 0). Similarly, the BMI coefficient in Class 1 is 0.0571, meaning higher BMI is associated with a higher risk of diabetes.

The model successfully converged after 60 iterations, achieving a final residual deviance of 136029.5 and an AIC (Akaike Information Criterion) of 136121.5. The lower the AIC, the better the model fits the data.